fix: retry transient per-object delete failures in removeObjects() cleanup by allanrogerr · Pull Request #1700 · minio/minio-java

allanrogerr · 2026-04-27T13:16:12Z

Problem

The removeObjects() cleanup helper in TestMinioClient.java silently discarded all deletion errors via ignore(r.get()). On transient server-side errors (e.g. SlowDown, InternalError) during a batch delete of 1050 objects, the failing objects remain in the bucket. The subsequent removeBucket() call then returns 409 BucketNotEmpty, causing the test to log a false FAIL from the cleanup path — even when the test logic itself passed.

This failure was observed in 5 distinct CI runs across the week ending 2026-04-24, always in testListObjects with 1050 objects, always as the sole minio-java failure while 13 of 14 other mint suites passed. Tracked in miniohq/eos#4608.

CI evidence (all show the same BucketNotEmpty on listObjects() / 1050 objects):

Repeated error fragment across all logs:

"name":"minio-java","function":"listObjects()","args":"[bucket, prefix, recursive, 1050 objects]",
"status":"FAIL","error":"S3 operation failed; ErrorResponseException{errorResponse=
ErrorResponse{code='BucketNotEmpty', message='The bucket you tried to delete is not empty',
bucketName='minio-java-test-<random>'}}`
``

## Fix

`removeObjects()` now **detects deletion errors and retries transient ones**:

- **Read errors**: Process each `DeleteResult.Error` instead of ignoring via `ignore(r.get())`
- **Non-transient errors** (`!TRANSIENT_DELETE_CODES`): throw `IOException` immediately (preserves actual failures)
- **Transient errors** (`InternalError`, `RequestTimeout`, `ServiceUnavailable`, `SlowDown`): retry the full set up to `MAX_DELETE_ATTEMPTS` (5) times with exponential backoff (1s / 2s / 4s / 8s)
- **Success**: break early if no errors returned

The `testRemoveObjects` wrapper's `finally` block is guarded by a `succeeded` flag so the cleanup is only re-attempted on test failure, preventing spurious timeouts on success.

The removeObjects() cleanup helper silently ignored all DeleteErrors via ignore(r.get()), masking transient server-side failures. When the batch delete of 1050 objects returned partial errors (e.g. SlowDown, InternalError), those objects remained in the bucket and the subsequent removeBucket() call failed with BucketNotEmpty — surfacing a cleanup race as a false test failure. Replace the single ignore() call with two focused helpers: - retryRemoveObject(): retries a single object with exponential backoff (500ms base) up to MAX_DELETE_RETRIES attempts. Accepts NoSuchKey/NoSuchVersion silently (already gone), retries SlowDown/InternalError/RequestTimeout/ServiceUnavailable, and propagates anything else immediately. - removeObjects(): fully consumes the batch-delete iterator first (so all 1000-object HTTP batches are sent), then delegates each DeleteResult.Error to retryRemoveObject() individually. DeleteResult.Error only exposes objectName() via ErrorResponse (no versionId), so a HashMap<String,List<ObjectWriteResponse>> index preserves correct versioned-delete semantics by storing all responses keyed by name and retrying every (name,versionId) pair on a miss.

Switch Result<?> to Result<DeleteResult.Error> to restore compile-time type safety and eliminate the unchecked cast. Drain the full server response before throwing on a non-transient error so OkHttp can reuse the connection slot. Restore the thread interrupt flag if Thread.sleep is interrupted. Wrap TRANSIENT_DELETE_CODES in an unmodifiable set. Guard the testRemoveObjects finally block so the redundant cleanup batch is only sent on failure. Add bucket name to error messages and document the retry-by-name re-queueing trade-off.

Copilot

Pull request overview

Improves functional-test cleanup reliability by adding retry logic for transient per-object failures during multi-object delete, to prevent false BucketNotEmpty failures during post-test teardown.

Changes:

Add transient-error classification and retry constants for multi-object deletion.
Rework removeObjects() to retry deletion based on DeleteResult.Error responses with exponential backoff.
Avoid double-invoking removeObjects() in testRemoveObjects() when the first call succeeds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MinioAsyncClient.removeObjects() sets completed=true as soon as any 1000-object chunk returns errors, silently dropping all subsequent chunks in that call. When toDelete exceeds 1000 entries and a transient error hits the first chunk, objects in later chunks are never attempted and not carried into the retry set, leaving the bucket non-empty. Fix by iterating toDelete in DELETE_BATCH_SIZE slices and issuing a separate removeObjects() call per slice so each slice is always fully processed regardless of errors in other slices.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

After capturing a non-transient error, the previous code continued issuing removeObjects() calls for all remaining chunks before throwing. Move the nonTransientErr check to after each individual chunk's response is drained so no further delete RPCs are made once a fatal error is seen.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Rename MAX_DELETE_RETRIES to MAX_DELETE_ATTEMPTS to better reflect that the constant bounds attempts, not retries. Include err.message() in the non-transient IOException for richer diagnostics. Track failedNames across iterations and include a 5-object sample in the exhaustion IOException to aid debugging when all retries are consumed.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

If removeObjects() in the try block throws, handleException() rethrows it. A second throw from the finally cleanup would then mask the original exception, making the test failure harder to diagnose. Wrap the cleanup call in try/catch so the original exception propagates unmasked.

The cleanup catch block in the finally clause intentionally discards the exception so the original test failure propagates unmasked. Suppress the DE_MIGHT_IGNORE finding at the method level.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ailure Replace byName index, retryNames set, and toDelete rebuild loop with a single anyTransient boolean flag. Retry the full object list on transient errors rather than narrowing to failing names — already-deleted objects are silently ignored by S3 and MinioAsyncClient filters NoSuchVersion.

allanrogerr · 2026-05-02T15:53:59Z

@balamurugana PTAL

…ailure

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Implements SDK-level automatic retry matching the minio-go retry-mechanism spec. All requests through BaseS3Client.executeAsync() now retry on retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.), retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and transient IO errors (connection reset, EOF, server closed idle connection). Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt; seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries (default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n)) before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap. The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region redirect logic is unaffected. maxRetries is configurable via builder and setMaxRetries() post-construction. Closes minio#1700.

Address review feedback on PR minio#1700: - Use Exception instead of IOException in removeObjects() helper for consistency with the rest of the test class. - Remove the finally/succeeded retry in testRemoveObjects() now that removeObjects() retries transient failures internally; drops the associated DE_MIGHT_IGNORE SpotBugs suppression.

shtripat

lgtm

allanrogerr added 2 commits April 27, 2026 13:01

style: apply Google Java Format (spotless)

30b5c35

allanrogerr self-assigned this Apr 27, 2026

balamurugana reviewed Apr 27, 2026

View reviewed changes

Comment thread functional/TestMinioClient.java Outdated

Comment thread functional/TestMinioClient.java Outdated

allanrogerr requested a review from Copilot April 27, 2026 20:39

Copilot started reviewing on behalf of allanrogerr April 27, 2026 20:40 View session